Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default reading order fallbacks via TOC #46

Closed
wants to merge 3 commits into from
Closed

Conversation

iherman
Copy link
Member

@iherman iherman commented Aug 23, 2017

This is my attempt to extract from a number of issues (#35, #36, #39) some aspects of TOC vs. reading orders that may represent a level of consensus. I was mostly inspired by the comment of @baldurbjarnason (#36 (comment)) which described that there may be a fallback both to a TOC if a default reading order is not present, and on default reading orders if a TOC is not present, although both are listed as separate information in the WP Infoset.

Obviously, the trigger for all this was also #35


Preview | Diff

@HadrienGardeur
Copy link
Member

I'm against the idea of an HTML TOC being a fallback for the primary reading order.

These are two completely separate concepts, and while I feel that the primary reading order may be a fallback for a TOC, it's not true the other way around without additional requirements from the TOC.

@iherman
Copy link
Member Author

iherman commented Aug 23, 2017

Adding additional requirement on a TOC for a fallback is perfectly fine, @HadrienGardeur. Let us formulate those and getting them documented. The goal remain to allow for very simple WP-s to be create-able easily, while letting the manifest control the complex cases.

@HadrienGardeur
Copy link
Member

@iherman but these additional requirements on a TOC only make sense when the TOC is a fallback. I'm not a big fan of that, it's better to have must/should/may for the TOC that are unrelated to its fallback status.

@mattgarrish
Copy link
Member

And these fallbacks seem to get more brittle as we try to construct ever more complex requirements. It's garbage in, garbage out processing.

I'm beginning to think these don't belong with the infoset but are guidance purely for a user agent to construct for an invalid publication. Legitimizing this for lazy authoring is a bad idea.

Anyway, some offered changes and thoughts.

As Hadrien has pointed out before, I believe, the same primary resource can occur multiple times legitimately, so you only want to ignore consecutive references to the same resource.

If the default reading order is not specified in the manifest, the user agent MUST construct it from the table of contents, if a table of contents is provided. The default reading order becomes the list of primary resources linked to from the table of contents, in the order they occur, and after removing any fragment identifiers and de-duplicating any consecutive resources that remain.

The table of contents generation sounds like a bit of a fishing expedition in the hopes that it is there but the user forgot to link to it. It seems unlikely the UA will find a nice doc-toc marked nav element, though. (If only the first one found is used, then you can skip saying to disregard other ones.)

In terms of a fallback that is reasonably assured to be there, though, I thought the current thinking was to take the titles of the primary resources in the reading order? You'll at least get some kind of table of contents for purely image-based works, since you can fall back to the url or file name.

@mattgarrish
Copy link
Member

mattgarrish commented Aug 23, 2017

And just to be clear, what I mean by lazy authoring is not that there may be legitimate cases where this information could be reliably constructed. Rather, once we start assuming things will fall nicely into place because of certain scenarios we can envisage we end up with a very permissive specification that enables bad experiences for users because authors are not aware of why the affordances are there.

We should have very tight rules and call this implicit information gathering (e.g., you don't have to list the default reading order, but you must set a "fromToc" flag) and/or push the processing into failure handling.

@GarthConboy
Copy link
Contributor

It does sound like there is general agreement that the HTML-supplied TOC could be a fallback for the case where the manifest doesn't include a specific TOC. So, maybe this PR should be split in half, getting that part in?

I waffle on HTML TOC as fallback for manifest-resident primary reading order. Two competing arguments:

-- Plus: makes authoring simple WPs easy, perhaps obviating the need for a JSON manifest for very simple WPs.
-- Minus: The two are pretty different concepts, and "perhaps obviating the need for a JSON manifest for very simple WPs" is a bug that could lead to two valid types of "manifests".

This morning I was more on this plus side, now I'm more on the minus side.

@HadrienGardeur
Copy link
Member

I think @mattgarrish is making a good point that some of these fallbacks should be candidates for failure handling rather than the normal way we process a manifest.

Right now we have a very reasonable list of requirements for our infoset and I'm not comfortable with fallbacks that can barely handle them, especially for the primary reading order.

The lack of a list of primary resources in a manifest also raises the following question: if I discover the first chapter of a publication, and through it the manifest, how do I locate the TOC if the manifest does not contain any primary resource?

Do I need to rely on well-known location or the presence of a specific link in the content to figure out where the TOC lives?

This doesn't sound like "easy authoring" to me, and I'm not even talking about all the issues around processing a TOC to extract a reading order from it.

@iherman iherman changed the title Toc/default reading order fallbacks via HTML default reading order fallbacks via TOC Aug 24, 2017
@iherman
Copy link
Member Author

iherman commented Aug 24, 2017

Admin: I agree with the #46 (comment) of @GarthConboy : it was a mistake to lump two issues into one PR. So I split it:

  1. This PR is on the section on default reading order, that is generated, as fallback, from a TOC
  2. PR Retrieving a TOC from HTML files #47 is on the TOC being found in an HTML resource if not explicit in the manifest

@iherman
Copy link
Member Author

iherman commented Aug 24, 2017

@GarthConboy summarized it as:

I waffle on HTML TOC as fallback for manifest-resident primary reading order. Two competing arguments:

-- Plus: makes authoring simple WPs easy, perhaps obviating the need for a JSON manifest for very simple WPs.
-- Minus: The two are pretty different concepts, and "perhaps obviating the need for a JSON manifest for very simple WPs" is a bug that could lead to two valid types of "manifests".

I would not dismiss the second point. Instead of being very general, I take a document that, in my view, must be considered as a candidate for a WP, namely the HTML51 spec[1]. Although there is a single file version of the spec, the basic one is cut into a moderately large number of HTML files (35, to be precise). It also has a fairly detailed and long TOC which is, actually, repeated in all HTML files. The fact that it is repeated over all HTML files is a matter of the current Rec styling; a previous version of HTML[2] had its full TOC in a single file, namely Overview.html[3].

My goal is: I would like to avoid that producer of the HTML spec would have to maintain a separate reading order in a manifest file. Or are we saying that, no matter what, authors must produce this? I sense significant push-back coming from the editors of a document like HTML.

Looking at the TOC right now[4] the rough algorithm described in the draft works fairly well: extract the URL-s in order from the TOC, and remove duplicates. What you get is the default reading order…

  1. https://www.w3.org/TR/html51/
  2. https://www.w3.org/TR/2014/REC-html5-20141028/
  3. https://www.w3.org/TR/2014/REC-html5-20141028/Overview.html#contents
  4. https://www.w3.org/TR/html/Overview.html

@HadrienGardeur
Copy link
Member

@iherman this is only one example, not all table of contents look the same.

Since we haven't defined at all what a TOC must/should/may contain, this feels extremely premature to have a TOC as an implicit fallback.

@iherman
Copy link
Member Author

iherman commented Aug 24, 2017 via email

@HadrienGardeur
Copy link
Member

@iherman

It's not the only issue that I have with it, but the fact that we haven't even defined:

  • what a TOC truly is
  • the requirements for a TOC
  • how we identify a TOC

... is enough to at least postpone this discussion.

If a manifest doesn't contain a list of primary resources or a title, this means that we could have a WP without a manifest, which brings additional issues (How do we establish WP-ness without a manifest? How do we locate a TOC without a manifest?) that must also be resolved before any of this.

@iherman
Copy link
Member Author

iherman commented Aug 24, 2017

@HadrienGardeur as I said, I fine postponing the discussion but not closing it. Maybe I am naïve, but I do not see such major issues, because I am also fine restricting the usable versions of TOC to the least complex ones.

Also: I would also be fine saying that a WP-ness requires the presence of a manifest. If its only role, in a specific case, is to ensure the declaration of being a WP, that is fine with me. What I want to avoid is that, in simple cases, the author would have to repeat the same information several times: I do not think that would catch on.

But again: postponement if fine.

@HadrienGardeur
Copy link
Member

Also: I would also be fine saying that a WP-ness requires the presence of a manifest. If its only role, in a specific case, is to ensure the declaration of being a WP, that is fine with me.

I'm not sure that a link to an empty document is a great idea though. Same thing for an empty script element.

@BigBlueHat
Copy link
Member

@HadrienGardeur this has been mentioned vaguely a few times: "this is only one example, not all table of contents look the same." from #46 (comment)

Would you be able to find examples (print, digital, or whatever) that would present something unique or (more importantly) preventative from deducing a reading order?

The more concrete the examples, the more concrete the spec.

@HadrienGardeur
Copy link
Member

I've seen such publications before but publishers like @laudrain might have an easier time than me providing such examples quickly.

@HadrienGardeur
Copy link
Member

I'll also cc @llemeurfr and @JayPanoz in here, they might be able to contact content producers to obtain examples for:

  • publications where the table of contents points to non-linear resources in EPUB
  • publications where the TOC has a hierarchy that differs from the reading order

It's not hard to find publication that skip primary resources, there are a lot of examples out there in EPUB.

I know that @baldurbjarnason has provided examples for non-linear publications before, these might also be relevant here.

@BigBlueHat
Copy link
Member

Thanks, @TzviyaSiegman. 😃

Thanks everyone for the examples! Sorry I requested them here... Let's do move them (and add more!) to the ToC-Samples wiki page.

...we now return this issue to it's actually intended use. 😁

@iherman
Copy link
Member Author

iherman commented Aug 25, 2017

If the TOC is "non-linear", then we do need the reading order explicitly in the manifest. What this tells me, and I believe we can agree on that, that the manifest MUST have a slot for a default reading order, and we cannot rely exclusively on the TOC in one of the resources. This also means that the authors of, say, cookbooks or travelbooks as WP-s should be aware of that, and they MUST provide an explicit reading order in the manifest.

However. The question is what is the percentage of such Web Publications among all Web Publications. Although we do not necessarily have empirical evidence, I believe that most of the WP-s would be much more "regular", whereby I mean that TOC entries follow the regular reading order, although they would refine that greatly, providing, e.g., links to individual sections within some of the resources. I would even go as far as saying that the vast majority of WP-s would be like that.

Taking the vast majority into account, a fallback on those continues to make sense to me, I must say. After all, fallback means "use this if the authoritative mechanism is not provided", where the authoritative mechanism is to have the default order in the manifest.

@iherman
Copy link
Member Author

iherman commented Aug 25, 2017 via email

@mattgarrish
Copy link
Member

where the authoritative mechanism is to have the default order in the manifest

To repeat myself, I'm sure, but can't we make a distinction between authoring intent and fallback processing?

For example: The reading order must be included in the manifest, which is either an ordered list of primary resources or a link to an html nav element that contains such a list of links (needing these to be clearly different links, of course).

These are the accepted ways to provide a reading order, but a reading system could piece one together by searching for a toc nav, inspecting the list of resources for certain media types (all html documents), etc.

@iherman
Copy link
Member Author

iherman commented Aug 25, 2017

@mattgarrish,

For example: The reading order must be included in the manifest, which is either an ordered list of primary resources or a link to an html nav element that contains such a list of links (needing these to be clearly different links, of course).

These are the accepted ways to provide a reading order, but a reading system could piece one together by searching for a toc nav, inspecting the list of resources for certain media types (all html documents), etc.

I may have missed this before, but I actually like this approach. In other words, at least for reading order but also for the TOC (and maybe for other things) we do something more explicit like this. It is also more efficient I suppose. Ie,

  • The manifest SHOULD include a TOC, which may either be a list of links, or a link to a nav element in a primary resource (with the proper role)
  • The manifest MUST include a default reading order, which again may be either a list of links or a link to a nav element in a primary resource (with the proper role); in this case the extra rule is the the links extracted from the nav element would be pruned to remove fragment IDs and duplicate entries
  • The manifest MUST include a title, which is either a string or a link to a title element.

And, for each case the fallback is not normatively defined but left it to the UA (using the text you have already here and there).

The extra load on the author may then become minimal. To use the HTML5 Spec example, it would involve a small JSON file with 2-3 links. No big deal (as opposed to repeat the TOC, for example).

My first reaction is: I think I like that.

I am not sure about the language, though...

@HadrienGardeur
Copy link
Member

@iherman

The manifest SHOULD include a TOC, which may either be a list of links, or a link to a nav element in a primary resource (with the proper role)

The TOC could also be a secondary resource IMO, it doesn't have to be in the reading order of the publication.

@mattgarrish
Copy link
Member

Primary resource is decoupled from reading order (at least for now). Primary is one that is not nested within another (top-level), so it would account for a toc outside the reading order.

@HadrienGardeur
Copy link
Member

@mattgarrish well I missed the part when primary resources became decoupled from the reading order...

I'm not a fan of talking about nesting, I would much rather have:

  • primary resources = reading order
  • secondary resources = everything else that's within the boundaries of the publication
  • external resources = everything referenced by primary or secondary resources that won't be packaged or cached for offline reading

@HadrienGardeur
Copy link
Member

@iherman

The manifest MUST include a default reading order, which again may be either a list of links or a link to a nav element in a primary resource (with the proper role); in this case the extra rule is the the links extracted from the nav element would be pruned to remove fragment IDs and duplicate entries

I'm still unhappy about that. Aside from TOCs that do not follow the reading order, reference fragments and repeat references to primary resources, there are other issues that are not addressed here:

  • a TOC pointing to secondary resources (which is the equivalent of pointing to a non-linear document in EPUB)
  • resources in the reading order that are not referenced by the TOC (usually preliminary contents or very large chapters/sections that are divided into multiple resources)

There are too many situations where a TOC does not contain the reading order, I really don't think it can be trusted as a fallback.

@mattgarrish
Copy link
Member

It was in the last PR. We had two problems with the definitions: one is that primary being reading order and reading order being primary is circular. The other, as we discussed in another thread, is that secondary is tied to being part of the rendering of a primary.

If primary are only resources in the reading order, there's almost no point in having a distinction. There are just resources of which some are in the reading order. The purpose of the rest is indeterminate until encountered. I just don't like this as far as any instruction about what needs to be in the manifest v. what is optional. Eventually it has to come around to some interpretation of standalone resource v. helper resource.

But this PR probably isn't the best place to discuss.

@BillKasdorf
Copy link

We also need to remember to qualify this as being about WPs that consist of more than one primary resource. As Benjamin reminded us, scholarly journal articles will surely be WPs and most consist of a single primary document, for which reading order and TOC are meaningless. There are millions of those out there. . . .

@HadrienGardeur
Copy link
Member

@BillKasdorf

Such a publication would have a single resource in the list of primary resources: problem solved.

@BillKasdorf
Copy link

Exactly. List of primary resources, always needed. Reading order or TOC, not always.

@HadrienGardeur
Copy link
Member

@mattgarrish probably not the best place, but this is completely tied to our discussion here...

If you follow my suggestion for primary vs secondary vs external, then secondary resources are useful for:

  • packaging
  • caching

I really dislike the notion of separating primary resources from the reading order, I don't see the point aside from making things more complex than they should be.

For secondary resources, treating them as "sub" or "nested" resources is fairly useless without knowing which resource reference them. We'd be better off just listing all resources that are not in the reading order.

@llemeurfr
Copy link
Contributor

This fork of the initial issue topic should IMO be moved to #16 (non-linear resources - primary, secondary or something else), where is seems it belongs.

@iherman
Copy link
Member Author

iherman commented Aug 25, 2017

I must say I am at loss, @HadrienGardeur, because I do not see the problem. If, for whatever reasons, the TOC in a resource is not appropriate for that purpose, then the author can (and should) decouple it from the reading order. In the new scheme (thanks to @mattgarrish) the author would use the manifest in its full beauty to define the reading order. The only extra feature provided is that if the TOC is fine, then a pointer to the TOC suffices.

In other word, for it to be valid, the manifest MUST include an explicit default reading order. The only thing is that the value of that default reading order may be a link to a TOC rather than a list of resources. It is (it must be) a conscious decision of the author (no automatic fallback).

@mattgarrish
Copy link
Member

@HadrienGardeur Yes, I can still live with where we ended up in #16. Nothing in the world is going to make someone list a major content resource omitted from the reading order, whatever we call it. That was the distinction I was trying to eke out is that some resources are still more important than others, even if not listed.

@mattgarrish
Copy link
Member

If the author explicitly links to a toc, then they've made the decision it is okay. That's how I can live with that.

As far as fallbacks, I don't think we should ever mandate an algorithm. Finding a toc nav in a primary resource and using it is just one option of what a UA might choose to do. Same with how title/language are worded.

@HadrienGardeur
Copy link
Member

Fully agree with @mattgarrish.

@GarthConboy
Copy link
Contributor

GarthConboy commented Aug 25, 2017

It does sounds like we're now on roughly the same, er, page here -- excellent.

And, just to provide an explicit thumbs up to a comment from @HadrienGardeur above:

primary resources = reading order
secondary resources = everything else that's within the boundaries of the publication
external resources = everything referenced by primary or secondary resources that won't be packaged or cached for offline reading

If an author explicitly references a nav element as the reading order (from the manifest), so be it, and those resources (with requisite trimming) are thus the primary ones.

@BillKasdorf
Copy link

Okay by me as long as we remember to say default reading order.

@mattgarrish
Copy link
Member

mattgarrish commented Aug 26, 2017

Closing in order to re-open consolidated PR. See #51.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants